How Noisy Social Media Text, How Diffrnt Social Media Sources?

نویسندگان

  • Timothy Baldwin
  • Paul Cook
  • Marco Lui
  • Andrew MacKinlay
  • Li Wang
چکیده

While various claims have been made about text in social media text being noisy, there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, which we compare to a reference corpus of edited English text. We first extract out various descriptive statistics from each data type (including the distribution of languages, average sentence length and proportion of out-ofvocabulary words), and then investigate the proportion of grammatical sentences in each, based on a linguistically-motivated parser. We also investigate the relative similarity between different data types.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Baldwin, Timothy, Paul Cook, Marco Lui, Andrew MacKinlay and Li Wang (to appear) How Noisy Social Media Text, How Diffrnt Social Media Sources?, In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), Nagoya, Japan

While various claims have been made about text in social media text being noisy, but there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia,...

متن کامل

The "what", "why", and "how" of Mining Noisy Data

Uncertain data is expected to comprise almost 80% of all available data in the near future. The problem of uncertain or noisy data is particularly severe in many of the newer sources of data such as Internet blogs, social media sites, mobile devices, and sensors. There is a strong need to devise new approaches in order to extract useful insights and intelligence from the abundant and rich infor...

متن کامل

A Study on the Use of Social Media to Understand Consumer Preference: The Case of Starbucks

The paper seeks to identify Starbuck's experience in using social media, understand how social media is linked to customer knowledge management, and assess how social media services could have contributed to Starbucks success. Starbucks demonstrates versatility to engage customers and support different part of customer knowledge management strategy through various social media platforms, such a...

متن کامل

A Lexicon Based Algorithm for Noisy Text Normalization as Pre-processing for Sentiment Analysis

Sentiment analysis in the most general sense refers to the classification of a piece of text into either of the three classes–positive, negative or neutral–according to its polarity. The text may be an entire document, a paragraph, a sentence, a phrase or even a single word. Most of the literature on sentiment analysis is dedicated to well-formed text as found in the newspapers, journals and ma...

متن کامل

Analyzing Content and Customer Engagement in Social Media with Deep Learning

In the present study, we investigate the effect of social media content on subsequent customer engagement (likes and reblogs) using a large-scale dataset from Tumblr. Our study focuses on companygenerated posts, which consist of two main information sources: visual (images) and textual (text and tags). We employ state-of-the-art machine learning approaches including deep learning to extract dat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013